What is this Analysis?

This notebook details my approach to building predictive models for newly released games on boardgamegeek.com (BGG). Specifically, I am interested in taking any newly released boardgame and, using features that are available at the time of its release, estimating how it will be received on BGG: its average rating, number of user ratings, and complexity rating.

While the goal of this project is ultimately to yield accurate predictions for upcoming games, we are also interested in understanding what the models learn. What features of games are associated with high/low average rating? Why do some games receive high numbers of user ratings? What types of games are the most complex?

To answer these questions, we’ll make use of historical data from boardgamegeek. We will connect to a database on GCP containing a variety of tables on game features and their current ratings on BGG. For this analysis, in training models, we will restrict ourserlves to games published through 2019. We will validate the performance of our models by evaluating their performance in predicting games published in 2020.

1 The Data

The data we are using comes from boardgamegeek.com, which we access by using the open BGG API. We are training models on data that last pulled from BGG on 2022-02-21.

We will be training models at the game-level, where every row corresponds to one game and every column corresponds to a feature of the game. As of this date, there are 22071 on boardgamegeek with at least 30 user ratings.

For the purpose of this analysis, the training set will only include games 2020 that have achieved at least 100 user ratings. This is a design decision to restrict our sample to games that 1) have received some evaluation from the community and 2) speed up the time in training models. We can later view this as a parameter for tuning, allowing more or less historical games to enter the model for training. Based on some initial tests, 100 was a useful cutoff point for both model performance and training time.

1.1 Outcomes

We are interested in modeling a number of different outcomes: a game’s average rating, complexity rating, and number of user ratings.

These outcomes aren’t independent, as complexity and the average rating are highly correlated.

As we will see, this means if we want to predict a game’s average rating, the most important feature is usually its average weight. But because these a game’s average rating and complexity are both voted on by the BGG community, we won’t know a game’s average rating at the time of its release. This means for newly upcoming games, we will first use a model to estimate a game’s complexity and then use that estimate as the input into our average rating model.

1.2 Features

What features do we have about games? We have basic information about every game, such as its player count and playing time, and we also have many BGG outcomes, such as the number of comments, number of people trading, which we will not use in predicting the outcomes we care about. We have some missingness present in the playing time variables that we will address in our recipe preparing the data.

1.3 Handling Categorical Features

We also have a variety of information about game mechanics, categories, artists, publishers, designers, artists, and so on. Some of these categories are not observed for every game, such as if a game doesn’t have expansions or integrations with other games.

This means there are ~180 different mechanics, ~7k publishers, ~ 10k designers, and ~ 11k artists present in our training set. This is good in the sense that we have ample information about games for models to look at and use in training, but bad in the sense that if we threw all of it into a model we would quickly run up against the the curse of dimensionality.

How can we make use of this information for modeling? We could create dummy variables for every different type, but this will quickly create thousands of features, many of which are going to contain little information. We would view this as a P > N problem and let the data speak for itself via methods of feature selection and dimension reduction.

Alternatively, every game had only one mechanic/designer/publisher, we could mean encode on the training set. For instance, instead of using thousands of dummy variables for each designer, we would have one ‘designer_mean’ feature that is simply the value the designer’s mean value in the training set. This can dramatically reduce the dimensionality of categorical features while keeping the information we want.

For our purposes, the hang up with taking a simple mean encoding approach is that a game may have multiple designers, categories, mechanics, artists, and publishers. For designers we might be able to get by with taking the mean of the designer means, but it starts to get more complicated with mechanics - most games have multiple different mechanics, and its the combination of different mechanics that are we interested in exploring. The other complication is that some designers have only designed a handful of games, while others have designed hundreds, so the mean may not impart the same amount of information.

On top of all of this, we have to be careful in what features we allow to enter a model, as some of the categories about games are themselves a reflection of the outcomes we want to predict.

With all this in mind, we’ll do bit of inspection to figure out which features of games we’ll allow to enter our training recipe, in essence using a manual filtering method to select features.

1.3.1 Families

One set of features relates to a game’s “family”, which is sort of a catch all term for various buckets that games might fall into: Kickstarters, dungeon crawls, tableau builders, etc. Some of these are likely to be very useful in training a model, while others should be omitted. We don’t, for instance, want to include whether a game has digital implementations, as these are a reflection of a game’s popularity. These sets of features also have a very long tail, with some families only having one or two games in them. We’ll filter to remove families with near zero variance, removing features on this variable that apply to a little less than 1% of games.

Some features we won’t include, such as the Mensa Select or implementations on BoardGameArena, as these are outcomes that typically occur when a game has been popular and shouldn’t be used as predictors.

1.3.2 Categories

We’ll do the same thing for categories, but this variable is much smaller and generally pretty well organized.

We’ll include all of these, though there will likely be some overlap between these and other features which we can take care of with a correlation filter.

1.3.3 Mechanics

Mechanics are also pretty well organized, so we don’t have to do much filtering.

We’ll just keep all of the mechanics, as these are the main features of games that we’ll focus our attention on.

1.3.4 Designers and Artists

How should we handle artist and designer effects? We’ll use a much lower minimum proption here, as very few designers would have designed ~ 100 games.

This amounts to allowing for designers once they have released about 15 games. We’ll more or less take the same approach for artists.

1.3.5 Publishers

Publishers are a bit more tricky. If we look at the top rated publishers, we’ll notice something a bit odd. Some of the publishers we’ll recognize, but we also see some names that might not make a a lot of sense. Why are Asmodee Italia and Galapagos towards the top? The reason for this is foreign language publishers - once a game becomes popular enough, these games end up being published in foreign languages. This means certain publishers are a reflection of the outcome we are trying to predict (the average and bayes average), and shouldn’t be used as predictors in models of these outcomes.

1.3.6 Publisher “White-List”

So what should we do? I’ve settled on creating a ‘white-list’ for publishers, meaning publishers that have been the original publisher of popular games.

1.4 Assemble Data

Putting this all together, we will keep the selected categorical features, creating dummy variables for each, which we will then parse down through a combination of near zero variance and correlation filters before modeling, then ultimately conducting feature selection within the modeling process.

# combine into one table
categorical_features_selected= rbindlist(mget(ls(pattern = "selected_"))) %>%
        as_tibble() %>%
        mutate(selected = "yes")

# select in full game types set
game_types_selected = game_types %>%
        left_join(., categorical_features_selected %>%
                          select(type, id, value, tidied, selected),
                  by = c("type", "id", "value")) %>%
        filter(selected == 'yes')

# pivot and spread these out
game_types_pivoted =game_types_selected %>%
        select(game_id, type, value) %>%
        mutate(type_abbrev = substr(type, 1, 3)) %>%
        mutate(value = tolower(gsub("[[:space:]]", "_", gsub("\\s+", " ", gsub("[[:punct:]]","", value))))) %>%
        mutate(type = paste(type, value, sep="_")) %>%
        mutate(has_type = 1) %>%
        select(-value) %>%
        pivot_wider(names_from = c("type"),
                            values_from = c("has_type"),
                            id_cols = c("game_id"),
                            names_sep = "_",
                            values_fn = min,
                            values_fill = 0)

# now join
games_model = active_games %>%
        left_join(.,
                  game_types_pivoted,
                  by = "game_id") 

# remove objects we don't need
rm(train_types,
   game_types_selected,
   game_types_pivoted,
   publishers,
   categorical_features_selected,
   families,
   mechanics,
   categories,
   designers,
   artists)

rm(list = ls(pattern = "selected_"))
rm(list = ls(pattern = "summarized"))

2 Modeling Set Up

We can now proceed to building predictive models. We’ll split the data into training, validation, and test sets, then do a bit of exploratory analysis on the training set. We’ll then create recipes which we use in training and evaluating our models.

2.1 Splitting the Data

First, we’ll split the data. We’ll set up our training, validation, and test sets based on the year games are published. Our training set will be games published prior to 2020 while our main validation set will be games published in 2020. We’ll use resampling within our training set to tune our models, validating their performance on the validation set. The test set will be all games published after our validation set.

# get full dataset
games_full = games_model %>%
        mutate(dataset = case_when(yearpublished <= params$end_training_year ~ 'train',
                                   yearpublished == params$end_training_year+1 ~ 'validation',
                                   TRUE ~ 'test')) %>%
        mutate(usersrated = log(usersrated))

# filter our training set to only games with at least n ratings
games_train = games_full %>%
        filter(dataset == 'train') %>%
        filter(usersrated >= log(params$min_ratings))

games_validation = games_full %>%
        filter(dataset == 'validation')

games_test =  games_full %>%
        filter(dataset == 'test')

We’re going to use the tidymodels framework for our model training and evaluation, so we’ll create custom splits around these for our workflows.

# make an initial split based on previously defined splits
validation_split = make_splits(list(analysis = seq(nrow(games_train)),
                                 assessment = nrow(games_train) + seq(nrow(games_validation))),
                            data  = bind_rows(games_train,
                                      games_validation))

2.2 Some Quick Exploratory Analysis

We can do a bit of exploratory analysis on this to help guide decisions we’ll make in our recipe, which we’ll build on the training set. We’ll look at some of the main numeric features of games, such as player counts, playingtime, and yearpublished, for each of our outcomes.

2.2.1 Scatter Plots

2.2.2 Correlation Plots

We can look at the correlation between our outcomes, the main features we have for games (playtime, player count, complexity) and some of the categorical features.

What about mechanics? Let’s look at the correlation between mechanics and these numeric features.

Correlation plot of game info and game mechanics

Correlation plot of game info and game mechanics

2.3 Make Recipes

We’ll make a basic recipe, which we’ll then update for each specific outcome.

# creating recipe with no formula or outcome specified yet
base_recipe = recipe(x = games_train) %>%
        update_role(all_numeric(),
                    new_role = "predictor") %>%
        step_mutate_at(c("averageweight"),
                         fn = ~ na_if(., 0),
                       skip = T) %>% # set to skip as this will be an outcome
        step_mutate_at(c("yearpublished",
                         "playingtime"),
                       fn = ~ na_if(., 0)) %>% # these variables come through as 0 if they are missing
        update_role(one_of("timestamp",
                      "dataset",
                      "game_id",
                      "name",
                      "numcomments",
                      "numweights",
                      "owned",
                      "trading",
                      "wanting",
                      "wishing",
                      "timestamp",
                      "average",
                      "bayesaverage",
                      "averageweight",
                      "usersrated",
                      "stddev"),
                      new_role = "id") %>%
        step_filter(!is.na(yearpublished)) %>%
        step_filter(!is.na(name)) %>%
        step_mutate(missing_minage = case_when(is.na(minage) ~ 1,
                                               TRUE ~ 0)) %>%
        step_mutate(missing_playtingtime = case_when(is.na(playingtime) ~ 1,
                                                     TRUE ~ 0)) %>%
        step_impute_median(playingtime,
                           maxplayers,
                           minage) %>% # medianimpute numeric predictors
        # step_mutate(published_prior_1950 = case_when(yearpublished<1950 ~ 1,
        #                                                TRUE ~ 0)) %>%
        step_mutate(minplayers = case_when(minplayers < 1 ~ 1,
                                                     minplayers > 10 ~ 10, # truncate
                                                     TRUE ~ minplayers),
                    maxplayers = case_when(maxplayers < 1 ~ minplayers,
                                                     maxplayers > 20 ~ 20,
                                                     TRUE ~ maxplayers)) %>%
        step_rm(minplaytime, maxplaytime) %>%
        step_mutate(time_per_player = playingtime/ maxplayers) %>% # make time per player variable
        step_mutate_at(starts_with("category_"),
                           fn = ~ replace_na(., 0)) %>%
        step_mutate_at(starts_with("mechanic_"),
                           fn = ~ replace_na(., 0)) %>%
        step_mutate_at(starts_with("artist_"),
                           fn = ~ replace_na(., 0)) %>%
        step_mutate_at(starts_with("designer_"),
                           fn = ~ replace_na(., 0)) %>%
        step_mutate_at(starts_with("publisher_"),
                           fn = ~ replace_na(., 0)) %>%  
        step_mutate_at(starts_with("family_"),
                           fn = ~ replace_na(., 0)) %>%
        step_mutate(number_mechanics = rowSums(across(starts_with("mechanic_"))),
                  #    number_artists = rowSums(across(starts_with("art_"))),
                      number_categories = rowSums(across(starts_with("category_")))) %>%
        step_zv(all_predictors()) %>%
        step_nzv(all_predictors(),
                   -starts_with("publisher_"),
                   -starts_with("artist_"),
                   -starts_with("designer_"),
                   freq_cut = 100/1) %>% 
        step_corr(all_predictors(),
                  threshold = 0.9) %>%
        step_mutate(published_prior_1950 = case_when(yearpublished < 1950 ~ 1,
                                                     TRUE ~ 0)) %>%
        step_mutate(trunc_yearpublished = case_when(yearpublished < 1950 ~ 1950,
                                              TRUE ~ yearpublished)) %>% # truncate
        # step_mutate(cut_yearpublished= yearpublished) %>%
        # step_cut(cut_yearpublished,
        #                      breaks = seq(1970, 2010, 10),
        #                      include_outside_range = T) %>%
        step_mutate(cut_playingtime= playingtime) %>%
        step_cut(cut_playingtime,
                             breaks = c(15, 45, 90, 180),
                             include_outside_range = T) %>%
        step_dummy(all_nominal_predictors()) %>%
        step_log(playingtime,
                   time_per_player,
                   offset = 1) %>%
        step_dummy(all_nominal_predictors()) %>%
        step_zv(all_predictors()) %>% # remove features with no variance
        step_nzv(all_predictors(),
                   -starts_with("publisher_"),
                   -starts_with("artist_"),
                   -starts_with("designer_"),
                   freq_cut = 100/1) %>% # apply near zero variance filter
        step_nzv(starts_with("artist_"),
                 -one_of(c("artist_ian_otoole",
                           "artist_chris_quilliams")), # allow for some specific artists, well known in recent years
                 freq_cut = 250/1) %>%
        step_corr(all_predictors(),
                  threshold = 0.9) # remove highly, highly correlated features

# save the base recipe
readr::write_rds(base_recipe,
                 here::here("models",
                            paste("base_recipe",
                                  "_", Sys.Date(), ".Rdata", sep="")))

For modeling the BGG average, bayesaverage, and usersrated, we’ll include averageweight as a feature and address missingness within the recipe with a simple model.

# average
recipe_average = base_recipe %>%
        update_role(average,
                    new_role = "outcome") %>%
        update_role(averageweight,
                    new_role = "predictor") %>%
        step_impute_linear(averageweight,
                     #      impute_with = imp_vars(all_predictors())) %>%
                        impute_with = imp_vars(
                                playingtime,
                                time_per_player,
                                number_mechanics))
# bayesaverage
recipe_bayesaverage = base_recipe %>%
        update_role(bayesaverage,
                    new_role = "outcome") %>%
        update_role(averageweight,
                    new_role = "predictor") %>%
        step_impute_linear(averageweight,
                     #      impute_with = imp_vars(all_predictors())) %>%
                        impute_with = imp_vars(
                                playingtime,
                                time_per_player,
                                number_mechanics)) # model complexity via a simple linear model of playingtime and mechanics

# usersrated
recipe_usersrated = base_recipe %>%
        update_role(usersrated,
                    new_role = "outcome") %>%
        update_role(averageweight,
                    new_role = "predictor") %>%
        step_impute_linear(averageweight,
                     #      impute_with = imp_vars(all_predictors())) %>%
                        impute_with = imp_vars(
                                playingtime,
                                time_per_player,
                                number_mechanics))

For modeling complexity (averageweight), we’ll trim the dataset further, omitting games for which we haven’t received enough votes on their complexity.

recipe_averageweight = base_recipe %>%
        update_role(averageweight,
                    new_role = "outcome") %>%
        step_filter(numweights > 10)

2.4 Define Models and Workflows

We’ll use a few different methods in modeling our outcome. I’ll mostly rely on penalized regression (glmnet) and xgboost (xgbTree), but I’ll also explore using bayesian linear models with Stan.

# penalized linear regression
glmnet_reg_mod<- 
  linear_reg(penalty = tune::tune(),
             mixture = 0.5) %>%
  set_engine("glmnet")

# specify grid for tuning
glmnet_grid <- tibble(penalty = 10^seq(-4, -0.5, 
                                       length.out = 30))

# xgbtree for regression
xgbTree_reg_mod <-
  parsnip::boost_tree(
    mode = "regression",
    trees = 250,
    sample_size = tune::tune(),
    min_n = tune::tune(),
    tree_depth = tune::tune()) %>%
  set_engine("xgboost",
             objective = "reg:squarederror")

# xgbTree grid
xgbTree_grid <- 
  expand.grid(
          sample_size = c(0.5, 0.75, 0.95),
          min_n = c(5, 15, 25),
          tree_depth = 3
  )

# stan linear regression
set.seed(123)
prior_dist <- rstanarm::student_t(df = 1) # student t prior
 
# lm with stan
stan_reg_mod <-   
  linear_reg() %>% 
  set_engine("stan", 
             prior_intercept = prior_dist, 
             prior = prior_dist,   
             iter = 8000)

We then create workflows for each model, for each outcome.

For the models that rely on tuning parameters (glmnet and xgbTree) we’ll define a cross validation split within the training set and tune across resamples.

3 Training Models

We’ll then train models for each outcome.

### outcome 1
## average
# glmnet
set.seed(1999)
average_glmnet_tune = average_glmnet_workflow %>%
        tune_grid(train_folds,
          grid = glmnet_grid,
          control = keep_pred,
          metrics = reg_metrics)

# xgbTree
set.seed(1999)
average_xgbTree_tune = average_xgbTree_workflow %>%
        tune_grid(train_folds,
          grid = xgbTree_grid,
          control = keep_pred,
          metrics = reg_metrics)

### outcome 2
## bayesaverage
# glmnet
set.seed(1999)
bayesaverage_glmnet_tune = bayesaverage_glmnet_workflow %>%
        tune_grid(train_folds,
          grid = glmnet_grid,
          control = keep_pred,
          metrics = reg_metrics)

# xgbTree
set.seed(1999)
bayesaverage_xgbTree_tune = bayesaverage_xgbTree_workflow %>%
        tune_grid(train_folds,
          grid = xgbTree_grid,
          control = keep_pred,
          metrics = reg_metrics)


### outcome 3
## usersrated
# glmnet
set.seed(1999)
usersrated_glmnet_tune = usersrated_glmnet_workflow %>%
        tune_grid(train_folds,
          grid = glmnet_grid,
          control = keep_pred,
          metrics = reg_metrics)

# xgbTree
set.seed(1999)
usersrated_xgbTree_tune = usersrated_xgbTree_workflow %>%
        tune_grid(train_folds,
          grid = xgbTree_grid,
          control = keep_pred,
          metrics = reg_metrics)

### outcome 3
## averageweight
# glmnet
set.seed(1999)
averageweight_glmnet_tune = averageweight_glmnet_workflow %>%
        tune_grid(train_folds,
          grid = glmnet_grid,
          control = keep_pred,
          metrics = reg_metrics)

# xgbTree
set.seed(1999)
averageweight_xgbTree_tune = averageweight_xgbTree_workflow %>%
        tune_grid(train_folds,
          grid = xgbTree_grid,
          control = keep_pred,
          metrics = reg_metrics)


# gather tuning results
tune_results = c(
        mget(ls(pattern = "glmnet_tune")),
        mget(ls(pattern = "xgbTree_tune"))
        )

# remove tuning results themselves
rm(list=ls(pattern="xgbTree_tune"))
rm(list =ls(pattern="glmnet_tune"))

For the linear models with Stan, we don’t need to tune any parameters, we’ll just fit them on our training set.

# register parallel
all_cores <- parallel::detectCores(logical = FALSE-2)
cl <- parallel::makePSOCKcluster(all_cores)
doParallel::registerDoParallel(cl)

### fit stan models by outcome
# average
average_stan_lm_fit = average_stan_lm_workflow %>%
        fit(games_train)

# bayesbayesaverage
bayesaverage_stan_lm_fit = bayesaverage_stan_lm_workflow %>%
        fit(games_train)

# usersrated
usersrated_stan_lm_fit = usersrated_stan_lm_workflow %>%
        fit(games_train)

# averageweight
averageweight_stan_lm_fit = averageweight_stan_lm_workflow %>%
        fit(games_train)

# remove parallel backend
registerDoSEQ()

For models with tuning parameters, we’ll refit on the training set using the tuning parameters which performed well in resampling.

### get tuning parameters
# glmnet
glmnet_best_pars =
        lapply(tune_results[grep("glmnet", names(tune_results))],
               show_best,
               "rmse",
               1)
# xgbTree
xgbTree_best_pars =
        lapply(tune_results[grep("xgbTree", names(tune_results))],
               show_best,
               "rmse",
               1)

## finalize workflows

### outcome 1
## average
# glmnet
set.seed(1999)
average_glmnet_last_fit= average_glmnet_workflow %>%
        finalize_workflow(glmnet_best_pars$average_glmnet_tune) %>%
        last_fit(validation_split, metrics = reg_metrics)

# xgbTree
set.seed(1999)
average_xgbTree_last_fit= average_xgbTree_workflow %>%
        finalize_workflow(xgbTree_best_pars$average_xgbTree_tune) %>%
        last_fit(validation_split, metrics = reg_metrics)

### outcome 2
## bayesaverage
# glmnet
set.seed(1999)
bayesaverage_glmnet_last_fit= bayesaverage_glmnet_workflow %>%
        finalize_workflow(glmnet_best_pars$bayesaverage_glmnet_tune) %>%
        last_fit(validation_split, metrics = reg_metrics)

# xgbTree
set.seed(1999)
bayesaverage_xgbTree_last_fit= bayesaverage_xgbTree_workflow %>%
        finalize_workflow(xgbTree_best_pars$bayesaverage_xgbTree_tune) %>%
        last_fit(validation_split, metrics = reg_metrics)

### outcome 3
## usersrated
# glmnet
set.seed(1999)
usersrated_glmnet_last_fit= usersrated_glmnet_workflow %>%
        finalize_workflow(glmnet_best_pars$usersrated_glmnet_tune) %>%
        last_fit(validation_split, metrics = reg_metrics)

# xgbTree
set.seed(1999)
usersrated_xgbTree_last_fit= usersrated_xgbTree_workflow %>%
        finalize_workflow(xgbTree_best_pars$usersrated_xgbTree_tune) %>%
        last_fit(validation_split, metrics = reg_metrics)

### outcome 4
## averageweight
# glmnet
set.seed(1999)
averageweight_glmnet_last_fit= averageweight_glmnet_workflow %>%
        finalize_workflow(glmnet_best_pars$averageweight_glmnet_tune) %>%
        last_fit(validation_split, metrics = reg_metrics)

# xgbTree
set.seed(1999)
averageweight_xgbTree_last_fit= averageweight_xgbTree_workflow %>%
        finalize_workflow(xgbTree_best_pars$averageweight_xgbTree_tune) %>%
        last_fit(validation_split, metrics = reg_metrics)

3.1 Training Results

Before we go to the validation set, we can take a look at the models from training, where we tuned the penalized linear models and gradient boosted trees and fit linear models with stan.

3.1.1 glmnet

For the penalized linear models (elastic net regularization), our lone tuning parameter is the penalty, or the amount of regularization used by the model. Looking at the results from resampling (examining RMSE for each fold), we achieved the best results with only a slight amount of regularization.

# collect metrics by fold from glmnet
lapply(tune_results[grep("glmnet", names(tune_results))], collect_metrics, summarize=F) %>% # get tuning results 
        rbindlist(., idcol =T) %>% # combine into dataframe
        filter(.metric == 'rmse') %>% 
        separate(".id", c("outcome", "method", "type")) %>%
        ggplot(., aes(x = penalty,
                      by = id,
                      color = id,
                      y = .estimate))+
        geom_line(lwd = 1.02)+
        facet_wrap(outcome~.,
                   ncol = 2,
                   scales = "free_y")+
        scale_color_viridis_d()+
        theme_bw(8)+
        ylab("rmse")

We used cross validation to estimate the appropriate values for this tuning parameter, but we can also look at the predictions from resampling to get a sense of how the model performed in predicting each outcome.

We’ll principally use the validation set to estimate performance on data not yet not seen by the model, as using cross validation for both tuning and model assessment at the same time tends to lead to overly optimistic estimates of model performance. Still, we can open up the predictions and and take a look.

# collect predictions with best tuning parameters
lapply(tune_results[grep("glmnet", names(tune_results))],collect_predictions) %>%
        rbindlist(., idcol=T, fill=T) %>%
        left_join(., glmnet_best_pars %>%
                          rbindlist(., idcol=T) %>%
                          mutate(best_tune = penalty) %>%
                          select(.id, best_tune),
                  by = ".id") %>%
        filter(penalty == best_tune) %>%
        select(id, .pred, .row, 
               average,
               averageweight,
               usersrated,
               bayesaverage) %>%
        gather("variable", "value",
               -id, -.pred, -.row) %>%
        left_join(., games_train %>%
                          mutate(.row = row_number()) %>%
                          select(.row, game_id, name),
                  by = c(".row")) %>%
        filter(!is.na(value)) %>%
        ggplot(., aes(x=.pred,
                      label = name,
                      y = value))+
        geom_point(alpha = 0.6)+
        geom_text(size=2,
                  check_overlap=T,
                  vjust = -1)+
        facet_wrap(variable ~.,
                   ncol =2,
                   scales="free")+
        theme_bw(8)+
        geom_abline(slope = 1,
                    intercept =0,
                    linetype = 'dashed')+
        geom_smooth(method = 'loess',
                    formula = 'y ~ x',
                    col = 'blue')+
        stat_cor(p.accuracy = .01,
                 col = 'blue')

We can also look at the features which had the largest partial effects by examining the absolute value of the coefficients (as we standardized our predictors during preprocessing). We’ll filter to the 40 features with the largest partial effects in each model.

3.1.2 xgbTree

For the gradient boosted trees, we can look at the results in resampling across a few of its tuning parameters.

# collect metrics by fold from xgbTree
lapply(tune_results[grep("xgbTree", names(tune_results))], 
       collect_metrics, summarize=F) %>% # get tuning results 
        rbindlist(., idcol =T) %>% # combine into dataframe
        filter(.metric == 'rmse') %>% 
        separate(".id", c("outcome", "method", "type")) %>%
        ggplot(., aes(x=sample_size,
                      group = id,
                      color =id,
                      shape = factor(min_n),
                      y = .estimate))+
        geom_line(lwd = 1.02)+
        facet_grid(outcome ~ min_n,
                   scales = "free_y")+
        theme_bw(8)+
        scale_color_viridis_d()

# collect predictions
lapply(tune_results[grep("xgbTree", names(tune_results))],
       collect_predictions) %>%
        rbindlist(., idcol=T, fill=T) %>%
        left_join(., xgbTree_best_pars %>%
                          rbindlist(., idcol=T) %>%
                          mutate(best_tune = 1),
                  by = c(".id",
                         "min_n",
                         "tree_depth",
                         "sample_size",
                         ".config")) %>%
        filter(best_tune==1) %>%
        select(id, .pred, .row, 
               average,
               averageweight,
               usersrated,
               bayesaverage) %>%
        gather("variable", "value",
               -id, -.pred, -.row) %>%
        left_join(.,
                  games_train %>%
                          mutate(.row = row_number()) %>%
                          select(game_id, name, .row),
                  by = c(".row")) %>%
        filter(!is.na(value)) %>%
        ggplot(., aes(x=.pred,
                      label = name,
                      y = value))+
        geom_point(alpha = 0.6)+
        geom_text(size=2,
                  check_overlap=T,
                  vjust = -1)+
        facet_wrap(variable ~.,
                   ncol =2,
                   scales="free")+
        theme_bw(8)+
        geom_abline(slope = 1,
                    intercept =0,
                    linetype = 'dashed')+
        geom_smooth(method = 'loess',
                    formula = 'y ~ x',
                    col = 'blue')+
        stat_cor(p.accuracy = .01,
                 col = 'blue')

While we don’t have coefficients from this model, we can extract measures such as variable importance (via methods such as variable permutation or Shapley scores).

# get fits
xgbTree_fits = mget(ls(pattern = "xgbTree_last_fit"))

# extract miodel specific variable importance, using gain (fractional contribution of each feature 
# based on the total gain of the corresponding feature's splits
lapply(xgbTree_fits,
                      function(x)
                      pluck(x, ".workflow", 1) %>%
                              extract_fit_parsnip() %>%
                              vip::vi_model(type="gain")) %>%
        rbindlist(., idcol = T) %>%
        separate(".id", c("outcome", "method", "fit", "type"), sep = "_") %>%
        group_by(outcome) %>%
        slice_max(order_by = Importance,
                  n = 40,
                  with_ties = F) %>%
        rename(term = Variable) %>%
        mutate(term = gsub("trunc_", "", term)) %>%
        mutate(term = tidy_name_func(term)) %>%
        ggplot(., aes(x=Importance,
                      y=reorder_within(term,
                                       Importance,
                                       outcome)))+
        geom_col(width=0.7)+
        facet_wrap(outcome ~.,
                   ncol =2,
                   scales="free_y")+
        scale_y_reordered()+
        theme_bw(8)+
        xlab("importance of feature")+
        ylab("feature")

3.1.3 Estimating the Geek Rating - Directly or Indirectly?

The geek rating (bayesaverage) is computed via a combination of user ratings and the average. The approximate formula is to start a game off with 1000 votes with a 5.5 average, which then will shift as a game accumulates ratings. I previously trained models to directly predict the bayesaverage, as well as the average and user ratings. I am interested to see, do we do better in estimating the geek rating by predicting its components (user ratings and average) and then computing it, or by directly estimating it with a model?

One way we can get a sense of model fit is to plot simulations from the model and compare it to the actual data. For the linear models fit with stan, it is easy to extract simulated datasets using the posterior. On the left, I display 100 simulated datasets from the stan model that estimated the geek rating directly. On the right, I display simulated datasets from the average and user rating stan models, from which I then computed the bayesaverage for every simulation.

In comparing the simulated results for the training set from both approaches, the indirect approach resembles the actual distribution much more closely. Both still have a tendency to overestimate games on the higher end, but the indirect approach generally does a better job in both identifying games that were above average and games that did not move above the baseline geek rating. Will this hold up in the validation set? We shall see. We’ll compute our indirect bayesaverage using predictions from each of the methods and see how it fares.

4 Validating Models

Having trained the models, we can now move into assessing their performance on the validation set, which are games published in 2020.

It’s important to keep in mind that we are, due to the lack of a true historical dataset, essentially estimating where games will converge to in the specified outcomes. For games recently published, especially during the pandemic, we might not yet have seen games reach stores and backers to accumulate user ratings.

# get model worfkows
trained_workflows = c(
        lapply(mget(ls(pattern = "_last_fit")),
                           function(x)
                                   pluck(x, ".workflow", 1)), # plucking tuned model workflows
        mget(ls(pattern = "_stan_lm_fit"))) # get stan fits


# now predict the validation set with these workflows
preds_validation = lapply(trained_workflows,
                          function(x)
                                  predict(x, new_data = games_validation)) %>%
        rbindlist(idcol = T) %>%
        as_tibble()

# calculate bayesaverage
preds_validation_tidied = preds_validation %>%
        group_by(.id) %>%
        mutate(.row = row_number()) %>%
        left_join(., games_validation %>%
                          mutate(.row = row_number()) %>%
                          select(.row, game_id, name),
                  by = ".row") %>%
        mutate(.id = gsub("_last", "", gsub("_fit", "", .id))) %>%
        spread(.id, .pred) %>%
        mutate(bayesaverage_indirect.glmnet = (
                ((1000*5.5) + (exp(usersrated_glmnet)*average_glmnet)) /
                        (1000 + exp(usersrated_glmnet)))
               ) %>%
        mutate(bayesaverage_indirect.xgbTree = (
                ((1000*5.5) + (exp(usersrated_xgbTree)*average_xgbTree)) /
                        (1000 + exp(usersrated_xgbTree)))
               ) %>%
        mutate(bayesaverage_indirect.stan_lm = (
                ((1000*5.5) + (exp(usersrated_stan_lm)*average_stan_lm)) /
                        (1000 + exp(usersrated_stan_lm)))
               ) %>%
        mutate_if(is.numeric, round, 3) %>%
        arrange(desc(bayesaverage_indirect.xgbTree))

4.1 Plots

We’ll assess the models using a variety of metrics, but I always like to visually compare the predictions of the model to the actual. I’ll separate out predictions for the geek rating so we can see the indirect vs direct approach.

For the user ratings and the geek rating in particular, we see the models pretty consistently overpredict games in the validation set. This has to do with what I mentioned before, we’re in a bit of a waiting game to see if these games will eventually move.

4.2 Tables

And as much as I like scatter plots, if we want to simply understand the aggregate performance of each model for each outcome, we really just need a table.

Generally speaking, the boosted trees tend to exhibit the strongest performance regardless of outcome and evaluation metric, though the improvement over the linear models is not dramatic.

4.3 Comparing Models

If we plot the predictions from the models for a sample of games, we can get a sense of where they agree and where they consistently miss.

4.4 Simulation

As before, we can take a look at simulated datasets from the linear model fit with stan vs the actual.

We can also examine the uncertainty around our prediction with the linear model via simulation. We can plot the 80% prediction interval for the geek rating for a sample of games compared to where they currently sit.

5 Predicting Upcoming Games

With all of this behind us, we can refit the models to the training and validation set in order to predict upcoming games.

We’ll then save these models to in order to use them more easily, but we can then predict upcoming games.

5.1 Highest Expected Average Rating

What upcoming games are expected to have the highest average rating?

5.2 Highest Expected User Ratings

What upcoming games are expected to be the most popular (ie, have the most ratings)?

5.3 Highest Expected Geek Rating

What upcoming games are expected to have the highest geek rating (user ratings + average)?